Model Selection

Vision-Language Fusion

# Vision-Language Fusion

DAM 3B Self Contained

DAM-3B is a vision-language model capable of generating fine-grained local descriptions based on user-specified image regions (points/boxes/sketches/masks).

Image-to-Text English

Gemma 3 4b It Abliterated Q4 0 GGUF

This model is a GGUF format conversion of mlabonne/gemma-3-4b-it-abliterated, combined with the visual component of x-ray_alpha for a smoother multimodal experience.

Diagram To Code Agent

This model is a vision-language fusion model specifically designed to convert diagrams into structured code.

Image-to-Text English

ColPali is a visual retrieval model based on PaliGemma-3B and ColBERT strategy, designed for efficient indexing of document visual features

Text-to-Image English

ChemVLM-8B is an 8-billion-parameter multimodal large language model specifically designed for the chemistry domain, capable of processing both text and visual chemical information.

ColPali is a visual retrieval model based on PaliGemma-3B and ColBERT strategy, designed for efficient document indexing from visual features.

Text-to-Image English

MMAlaya is a multimodal system developed based on the large language model Alaya, comprising three core components: a large language model, an image-text feature encoder, and a feature transformation module.

Llava Plus V0 7b

LLaVA-Plus is a pluggable learning skill-based large language and vision assistant, primarily used for academic research in multimodal models and chatbots.

Llava V1.5 13b Lora

LLaVA is an open-source multimodal chatbot, fine-tuned from LLaMA/Vicuna and trained on GPT-generated multimodal instruction-following data.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase